Author: Colby Huang
Data sourced from https://www.kaggle.com/rtatman/188-million-us-wildfires
The dataset used for this case study is a database of about 1.88 million wildfires in the United States from the years 1992 to 2015. This dataset interests me because my family lives in California, where over the last few years there have been many fierce fire seasons and summer skies have often been obscured by smoke. Since my hobby is astrophotography, which demands clear night skies, the frustration of having nights ruined by the smoky haze despite clear weather piqued my interest in these wildfires and whether it had always been like this.

Source: firemap.sdsc.edu
The above is a timelapse of smoke from the August Complex, a 2020 wildfire complex that burned over a million acres and remains the largest wildfire in California history. While this dataset doesn't contain data that recent, studying it still may reveal useful trends.
# some imports
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
Let's import our dataset, starting with the Fires table. The dataset is split into 24 CSV files in /data, one for each year from 1992 to 2015 inclusive. We'll load each year's data and concatenate it with the rest. Depending on your system, this may take a short while. We should end up with 1880465 rows and 38 columns minus the ones we drop.
# df_temp = pd.read_csv("data/Fires_1992.csv", low_memory=False)
# df = df_temp.copy()
# for i in range(1993, 2016):
# df_temp = pd.read_csv(f"data/Fires_{i}.csv", low_memory=False)
# df = df.append(df_temp)
### Comment out the above lines and uncomment the below lines to get dataset from online repo ###
df_temp = pd.read_csv("https://raw.githubusercontent.com/colbyyh2/wildfire92-15_eda/main/data/Fires_1992.csv", low_memory=False)
df = df_temp.copy()
for i in range(1993, 2016):
df_temp = pd.read_csv(f"https://raw.githubusercontent.com/colbyyh2/wildfire92-15_eda/main/data/Fires_{i}.csv", low_memory=False)
df = df.append(df_temp)
del df_temp
df.drop(columns=[
'OBJECTID',
'FPA_ID',
'SOURCE_SYSTEM',
'SOURCE_REPORTING_UNIT_NAME',
'SOURCE_REPORTING_UNIT',
'SOURCE_REPORTING_UNIT_NAME',
'STAT_CAUSE_CODE',
'OWNER_CODE',
'FIRE_CODE',
'COUNTY',
'FIPS_CODE'
],
inplace=True)
num_rows, num_cols = df.shape
print(f"The Fires table has {num_rows} rows and {num_cols} columns.")
The Fires table has 1880465 rows and 28 columns.
Let's examine the dataset. A description of what each of these columns are can be found at https://www.kaggle.com/rtatman/188-million-us-wildfires. The DISCOVERY_DATE and CONT_DATE columns have been reformatted as date strings (yyyy-mm-dd).
pd.set_option('display.max_columns', 100)
df.head()
| FOD_ID | SOURCE_SYSTEM_TYPE | NWCG_REPORTING_AGENCY | NWCG_REPORTING_UNIT_ID | NWCG_REPORTING_UNIT_NAME | LOCAL_FIRE_REPORT_ID | LOCAL_INCIDENT_ID | FIRE_NAME | ICS_209_INCIDENT_NUMBER | ICS_209_NAME | MTBS_ID | MTBS_FIRE_NAME | COMPLEX_NAME | FIRE_YEAR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | STAT_CAUSE_DESCR | CONT_DATE | CONT_DOY | CONT_TIME | FIRE_SIZE | FIRE_SIZE_CLASS | LATITUDE | LONGITUDE | OWNER_DESCR | STATE | FIPS_NAME | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 42087 | FED | FS | USMTBDF | Beaverhead/Deerlodge National Forest | 202 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1992 | 1992-05-19 | 140 | 1420.0 | Lightning | 1992-05-19 | 140.0 | 1700.0 | 1.0 | B | 45.360000 | -113.078333 | OTHER FEDERAL | MT | NaN |
| 1 | 42088 | FED | FS | USMTBDF | Beaverhead/Deerlodge National Forest | 212 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1992 | 1992-08-05 | 218 | 1700.0 | Lightning | 1992-08-06 | 219.0 | 1600.0 | 0.6 | B | 44.540000 | -112.683333 | USFS | MT | NaN |
| 2 | 42089 | FED | FS | USMTBDF | Beaverhead/Deerlodge National Forest | 221 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1992 | 1992-10-05 | 279 | 1400.0 | Campfire | 1992-10-07 | 281.0 | 1630.0 | 0.1 | A | 44.516667 | -112.983333 | USFS | MT | NaN |
| 3 | 42090 | FED | FS | USMTBDF | Beaverhead/Deerlodge National Forest | 222 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1992 | 1992-10-24 | 298 | 1515.0 | Campfire | 1992-10-24 | 298.0 | 1525.0 | 0.1 | A | 44.690000 | -112.730000 | USFS | MT | NaN |
| 4 | 42091 | FED | FS | USMTBDF | Beaverhead/Deerlodge National Forest | 204 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1992 | 1992-06-07 | 159 | 1000.0 | Campfire | 1992-06-07 | 159.0 | 1030.0 | 0.1 | A | 45.763333 | -112.820000 | STATE OR PRIVATE | MT | NaN |
df.tail()
| FOD_ID | SOURCE_SYSTEM_TYPE | NWCG_REPORTING_AGENCY | NWCG_REPORTING_UNIT_ID | NWCG_REPORTING_UNIT_NAME | LOCAL_FIRE_REPORT_ID | LOCAL_INCIDENT_ID | FIRE_NAME | ICS_209_INCIDENT_NUMBER | ICS_209_NAME | MTBS_ID | MTBS_FIRE_NAME | COMPLEX_NAME | FIRE_YEAR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | STAT_CAUSE_DESCR | CONT_DATE | CONT_DOY | CONT_TIME | FIRE_SIZE | FIRE_SIZE_CLASS | LATITUDE | LONGITUDE | OWNER_DESCR | STATE | FIPS_NAME | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 74486 | 300348363 | NONFED | ST/C&L | USCASHU | Shasta-Trinity Unit | 591814.0 | 009371 | ODESSA 2 | NaN | NaN | NaN | NaN | NaN | 2015 | 2015-09-26 | 269 | 1726.0 | Missing/Undefined | 2015-09-26 | 269.0 | 1843.0 | 0.01 | A | 40.481637 | -122.389375 | STATE OR PRIVATE | CA | NaN |
| 74487 | 300348373 | NONFED | ST/C&L | USCATCU | Tuolumne-Calaveras Unit | 569419.0 | 000366 | NaN | NaN | NaN | NaN | NaN | NaN | 2015 | 2015-10-05 | 278 | 126.0 | Miscellaneous | NaN | NaN | NaN | 0.20 | A | 37.617619 | -120.938570 | MUNICIPAL/LOCAL | CA | NaN |
| 74488 | 300348375 | NONFED | ST/C&L | USCATCU | Tuolumne-Calaveras Unit | 574245.0 | 000158 | NaN | NaN | NaN | NaN | NaN | NaN | 2015 | 2015-05-02 | 122 | 2052.0 | Missing/Undefined | NaN | NaN | NaN | 0.10 | A | 37.617619 | -120.938570 | MUNICIPAL/LOCAL | CA | NaN |
| 74489 | 300348377 | NONFED | ST/C&L | USCATCU | Tuolumne-Calaveras Unit | 570462.0 | 000380 | NaN | NaN | NaN | NaN | NaN | NaN | 2015 | 2015-10-14 | 287 | 2309.0 | Missing/Undefined | NaN | NaN | NaN | 2.00 | B | 37.672235 | -120.898356 | MUNICIPAL/LOCAL | CA | NaN |
| 74490 | 300348399 | NONFED | ST/C&L | USCABDU | San Bernardino Unit | 535436.0 | 003225 | BARKER BL BIG_BEAR_LAKE_ | NaN | NaN | NaN | NaN | NaN | 2015 | 2015-03-14 | 73 | 2128.0 | Miscellaneous | NaN | NaN | NaN | 0.10 | A | 34.263217 | -116.830950 | STATE OR PRIVATE | CA | NaN |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1880465 entries, 0 to 74490 Data columns (total 28 columns): # Column Dtype --- ------ ----- 0 FOD_ID int64 1 SOURCE_SYSTEM_TYPE object 2 NWCG_REPORTING_AGENCY object 3 NWCG_REPORTING_UNIT_ID object 4 NWCG_REPORTING_UNIT_NAME object 5 LOCAL_FIRE_REPORT_ID object 6 LOCAL_INCIDENT_ID object 7 FIRE_NAME object 8 ICS_209_INCIDENT_NUMBER object 9 ICS_209_NAME object 10 MTBS_ID object 11 MTBS_FIRE_NAME object 12 COMPLEX_NAME object 13 FIRE_YEAR int64 14 DISCOVERY_DATE object 15 DISCOVERY_DOY int64 16 DISCOVERY_TIME float64 17 STAT_CAUSE_DESCR object 18 CONT_DATE object 19 CONT_DOY float64 20 CONT_TIME float64 21 FIRE_SIZE float64 22 FIRE_SIZE_CLASS object 23 LATITUDE float64 24 LONGITUDE float64 25 OWNER_DESCR object 26 STATE object 27 FIPS_NAME object dtypes: float64(6), int64(3), object(19) memory usage: 416.1+ MB
df_small_sample=df.sample(n=150)
df_small_sample.head(10)
| FOD_ID | SOURCE_SYSTEM_TYPE | NWCG_REPORTING_AGENCY | NWCG_REPORTING_UNIT_ID | NWCG_REPORTING_UNIT_NAME | LOCAL_FIRE_REPORT_ID | LOCAL_INCIDENT_ID | FIRE_NAME | ICS_209_INCIDENT_NUMBER | ICS_209_NAME | MTBS_ID | MTBS_FIRE_NAME | COMPLEX_NAME | FIRE_YEAR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | STAT_CAUSE_DESCR | CONT_DATE | CONT_DOY | CONT_TIME | FIRE_SIZE | FIRE_SIZE_CLASS | LATITUDE | LONGITUDE | OWNER_DESCR | STATE | FIPS_NAME | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1507 | 106665 | FED | FS | USMTBDF | Beaverhead/Deerlodge National Forest | 31.0 | NaN | WAUKENA | NaN | NaN | NaN | NaN | NaN | 1998 | 1998-08-15 | 227 | 1500.0 | Lightning | 1998-08-16 | 228.0 | 800.0 | 5.00 | B | 45.548333 | -112.893333 | USFS | MT | NaN |
| 95307 | 19969762 | NONFED | ST/C&L | USWIWIS | Wisconsin Department of Natural Resources | NaN | 43 | NaN | NaN | NaN | NaN | NaN | NaN | 2000 | 2000-03-26 | 86 | NaN | Debris Burning | NaN | NaN | NaN | 1.20 | B | 45.053536 | -90.215023 | MISSING/NOT SPECIFIED | WI | Taylor |
| 29446 | 201611237 | NONFED | ST/C&L | USFLFLS | Florida Forest Service | NaN | 2012060219 | PREVATT CREEK (04) | NaN | NaN | NaN | NaN | NaN | 2012 | 2012-02-04 | 35 | 1620.0 | Debris Burning | 2012-02-04 | 35.0 | 1750.0 | 3.00 | B | 29.909100 | -82.131800 | PRIVATE | FL | Bradford |
| 42323 | 691621 | NONFED | ST/C&L | USTXTXS | Texas A & M Forest Service | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008 | 2008-01-12 | 12 | NaN | Debris Burning | NaN | NaN | NaN | 0.50 | B | 32.142300 | -96.241260 | MISSING/NOT SPECIFIED | TX | Navarro |
| 42742 | 692346 | NONFED | ST/C&L | USTXTXS | Texas A & M Forest Service | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008 | 2008-01-29 | 29 | NaN | Debris Burning | NaN | NaN | NaN | 2.00 | B | 32.297980 | -95.435870 | MISSING/NOT SPECIFIED | TX | Smith |
| 64685 | 19090648 | NONFED | ST/C&L | USGAGAS | Georgia Forestry Commission | NaN | 76 | NaN | NaN | NaN | NaN | NaN | NaN | 1992 | 1992-02-11 | 42 | 1325.0 | Debris Burning | 1992-02-11 | 42.0 | 1346.0 | 1.05 | B | 34.841600 | -85.531900 | PRIVATE | GA | Dade |
| 64484 | 1075609 | NONFED | ST/C&L | USVAVAS | Virginia Department of Forestry | NaN | 73 | NaN | NaN | NaN | NaN | NaN | NaN | 1999 | 1999-11-16 | 320 | NaN | Smoking | NaN | NaN | NaN | 10.00 | C | 37.516700 | -79.183300 | MISSING/NOT SPECIFIED | VA | NaN |
| 38824 | 629285 | NONFED | ST/C&L | USSCSCS | South Carolina Forestry Commission | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008 | 2008-06-12 | 164 | NaN | Debris Burning | NaN | NaN | NaN | 0.50 | B | 34.050710 | -80.520590 | MISSING/NOT SPECIFIED | SC | Sumter |
| 80323 | 1822340 | NONFED | ST/C&L | USNYNYX | Fire Department of New York | NaN | NY5822-2005-052219 | NaN | NaN | NaN | NaN | NaN | NaN | 2005 | 2005-08-07 | 219 | 1129.0 | Debris Burning | 2005-08-07 | 219.0 | 1129.0 | 1.00 | B | 43.274688 | -73.397136 | MISSING/NOT SPECIFIED | NY | Washington |
| 36821 | 1022595 | NONFED | ST/C&L | USLALAS | Louisiana Office of Forestry | NaN | LA3-X3 | NaN | NaN | NaN | NaN | NaN | NaN | 2002 | 2002-12-22 | 356 | NaN | Arson | NaN | NaN | NaN | 35.00 | C | 31.687200 | -92.407500 | MISSING/NOT SPECIFIED | LA | NaN |
Let's take a look at the sources of these fire reports.
The column SOURCE_SYSTEM_TYPE indicates if the record was drawn from a federal, nonfederal, or interagency database.
px.pie(df, names='SOURCE_SYSTEM_TYPE',
title="Breakdown of database types of fire records (Federal, Non Federal, Interagency)",
width=600,height=400)